Description and Annotation of Biomedical Data Sets

نویسنده

  • Jen Ferguson
چکیده

Deposition of biomedical data sets is on the rise as more scientists submit experimental data to accompany their publications. Scientists are also increasingly reusing these publicly available data sets in their own work. Despite these developments, lack of both context and metadata can create barriers to understanding and repurposing these data sets. Researchers from the Bioinformatics Core Group in the Harvard School of Public Health attempted to address this issue by assembling a team of data curators who used the open source software suite ISA tools to annotate and contextualize microarray data sets. This paper describes the workflow and software used in curating these data sets, discusses similarities and differences in the approaches of team members to the work, and suggests possible roles for librarians in similar data curation projects. Biomedical data deposition is on the rise as more scientists make their experimental data openly available (Piwowar and Chapman 2010). This phenomenon can likely be attributed in part to increasing pressure from publishers and funding agencies to encourage and even mandate data deposition to accompany publication. In a recent survey, more than 40% of peer reviewers for the journal Science indicated that they routinely access or use the data sets that accompany publications (Science 2011). Researchers use these data sets in a variety of ways, including validation and testing of statistical models, and critical evaluation of data discussed in publications. Some works rely heavily upon this body of publicly available data sets, employing data mining for much of their investigative basis. In perhaps the best known example, Mootha and colleagues (2003) successfully identified the human genetic defect that gives rise to Leigh syndrome by first mining publicly available data. Despite these developments, lack of context and metadata can still create obstacles to understanding and reuse of data sets. Certain types of biomedical data, such as sequence data, can be interpreted fairly simply; little additional context aside from the sequence itself is necessary to make use of the data. Gene expression microarray data, on the other hand, require thorough understanding of the experimental context and conditions that produced it. As a result, comprehension and reuse of microarray data *Jen Ferguson is presently the Data Services Librarian at Northeastern University, Boston, MA, USA. Correspondence to Jen Ferguson: [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementation and Optimization of Annotation and Interpretation Step of Next-Generation Sequencing Data for Non-Syndromic Autosomal Recessive Hearing Loss

Introduction: The precision and time required for analysis of data in next-generation sequencing (NGS) depends on many factors including the tools utilized for alignment, variant calling, annotation and filtering of variants, personnel expertise in data analysis and interpretation, and computational capacity of the lab and its optimization is a challenging task.  Method: An application software...

متن کامل

Implementation and Optimization of Annotation and Interpretation Step of Next-Generation Sequencing Data for Non-Syndromic Autosomal Recessive Hearing Loss

Introduction: The precision and time required for analysis of data in next-generation sequencing (NGS) depends on many factors including the tools utilized for alignment, variant calling, annotation and filtering of variants, personnel expertise in data analysis and interpretation, and computational capacity of the lab and its optimization is a challenging task.  Method: An application software...

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Introduction to the Biomedical Linked Annotation Hackathon (BLAH) 2015 Symposium

Scientific literature is a central repository of scientific knowledge every important scientific discovery has been published in it. As such, it has become a main target of data mining, and in particular, text mining. However, the unstructured, or covertly structured, nature of natural language texts poses a major barrier to accessing the contents of literature. The technology of literature ann...

متن کامل

Rough sets theory in site selection decision making for water reservoirs

Rough Sets theory is a mathematical approach for analysis of a vague description of objects presented by a well-known mathematician, Pawlak (1982, 1991). This paper explores the use of Rough Sets theory in site location investigation of buried concrete water reservoirs. Making an appropriate decision in site location can always avoid unnecessary expensive costs which is very important in constr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013